Problem Context¶

Background and Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

You need to identify the best possible model that will give the required performance.

Problem Statement¶

Objectives

  • Explore and visualize the dataset.
  • Build a classification model to predict if the customer is going to churn
  • Optimize the model using appropriate techniques
  • Generate a set of insights and recommendations that will help the bank

Data Description¶

CLIENTNUM: Client number. Unique identifier for the customer holding the account

Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"

Customer_Age: Age in Years

Gender: Gender of the account holder

Dependent_count: Number of dependents

Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.

Marital_Status: Marital Status of the account holder

Income_Category: Annual Income Category of the account holder

Card_Category: Type of Card

Months_on_book: Period of relationship with the bank

Total_Relationship_Count: Total no. of products held by the customer

Months_Inactive_12_mon: No. of months inactive in the last 12 months

Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months

Credit_Limit: Credit Limit on the Credit Card

Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance

Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)

Total_Trans_Amt: Total Transaction Amount (Last 12 months)

Total_Trans_Ct: Total Transaction Count (Last 12 months)

Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter

Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter

Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

Importing the libraries required¶

In [1]:
#Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

#Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

#Command to tell python to actually display the graphs
%matplotlib inline

# Supress warnings
import warnings
warnings.filterwarnings('ignore') 

# To split the data into train and test
from sklearn.model_selection import train_test_split

# Supress warnings
import warnings
warnings.filterwarnings('ignore') 

# To split the data into train and test
from sklearn.model_selection import train_test_split

# To be used for missing value imputation
from sklearn.impute import SimpleImputer
# To be used for data encoding
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# To create k folds of data and determine the cross validation score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score,StratifiedKFold

# To undersample and oversample the data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    GradientBoostingClassifier,
    AdaBoostClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier

# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# To get diferent metric scores
from sklearn import metrics

from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
)

# To be used for creating pipelines and personalizing them
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

Loading the dataset¶

In [2]:
csv_data = pd.read_csv("BankChurners.csv")
data = csv_data.copy()

Overview of the dataset¶

Understand the shape of the dataset

In [3]:
print("The dataset has {} rows and {} columns".format(data.shape[0], data.shape[1]))
The dataset has 10127 rows and 21 columns

View the first and last 5 rows of the dataset

In [4]:
data.head(5)
Out[4]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book ... Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 ... 1 3 12691.0 777 11914.0 1.335 1144 42 1.625 0.061
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 ... 1 2 8256.0 864 7392.0 1.541 1291 33 3.714 0.105
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 ... 1 0 3418.0 0 3418.0 2.594 1887 20 2.333 0.000
3 769911858 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 ... 4 1 3313.0 2517 796.0 1.405 1171 20 2.333 0.760
4 709106358 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 ... 1 0 4716.0 0 4716.0 2.175 816 28 2.500 0.000

5 rows × 21 columns

In [5]:
data.tail(5)
Out[5]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book ... Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
10122 772366833 Existing Customer 50 M 2 Graduate Single $40K - $60K Blue 40 ... 2 3 4003.0 1851 2152.0 0.703 15476 117 0.857 0.462
10123 710638233 Attrited Customer 41 M 2 NaN Divorced $40K - $60K Blue 25 ... 2 3 4277.0 2186 2091.0 0.804 8764 69 0.683 0.511
10124 716506083 Attrited Customer 44 F 1 High School Married Less than $40K Blue 36 ... 3 4 5409.0 0 5409.0 0.819 10291 60 0.818 0.000
10125 717406983 Attrited Customer 30 M 2 Graduate NaN $40K - $60K Blue 36 ... 3 3 5281.0 0 5281.0 0.535 8395 62 0.722 0.000
10126 714337233 Attrited Customer 43 F 2 Graduate Married Less than $40K Silver 25 ... 2 4 10388.0 1961 8427.0 0.703 10294 61 0.649 0.189

5 rows × 21 columns

Check the data types of the columns

In [6]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10127 non-null  int64  
 1   Attrition_Flag            10127 non-null  object 
 2   Customer_Age              10127 non-null  int64  
 3   Gender                    10127 non-null  object 
 4   Dependent_count           10127 non-null  int64  
 5   Education_Level           8608 non-null   object 
 6   Marital_Status            9378 non-null   object 
 7   Income_Category           10127 non-null  object 
 8   Card_Category             10127 non-null  object 
 9   Months_on_book            10127 non-null  int64  
 10  Total_Relationship_Count  10127 non-null  int64  
 11  Months_Inactive_12_mon    10127 non-null  int64  
 12  Contacts_Count_12_mon     10127 non-null  int64  
 13  Credit_Limit              10127 non-null  float64
 14  Total_Revolving_Bal       10127 non-null  int64  
 15  Avg_Open_To_Buy           10127 non-null  float64
 16  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 17  Total_Trans_Amt           10127 non-null  int64  
 18  Total_Trans_Ct            10127 non-null  int64  
 19  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 20  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB

CLIENTNUM could be an index column with all unique values and need to be checked.

2 columns, Education_Level & Marital_Status seem to contain null/missing values.

6 out of 21 columns are of object/string type and the rest are all numeric.

Attrition_Flag is the target/dependent variable.

The memory usage is at 1.6+MB.

In [7]:
#Check to see if CLIENTNUM column is merely an index column with unique values
data['CLIENTNUM'].nunique()
Out[7]:
10127

CLIENTNUM column has all unique values. It will not add any predictive power to the machine learning model, and can hence be dropped.

In [8]:
data.drop('CLIENTNUM', axis=1, inplace=True)
In [9]:
#Confirm that we are now left with 2o columns
data.shape[1]
Out[9]:
20

Duplicates - Now that we have dropped the column with all unique values, lets check the dataset for duplicate rows

In [10]:
print('Found {} duplicate rows'.format(data.duplicated().sum()))
Found 0 duplicate rows

Listing columns with Null Values

In [11]:
data.isnull().sum()
a = data.isnull().sum(axis=0)

b = np.round(data.isnull().sum(axis=0) /data.fillna(0).count(axis=0),2)

missing_data = pd.DataFrame({'missing_vals' : a,
             'missing_ratio' : b})
missing_data = missing_data[missing_data.missing_vals.ne(0)]
print(missing_data)
                 missing_vals  missing_ratio
Education_Level          1519           0.15
Marital_Status            749           0.07

Education_Level and Marital_Status both object data types seem to have 15% and 7% data missing respectively.

Statistical summary of the data¶

In [12]:
data.describe().T
Out[12]:
count mean std min 25% 50% 75% max
Customer_Age 10127.0 46.325960 8.016814 26.0 41.000 46.000 52.000 73.000
Dependent_count 10127.0 2.346203 1.298908 0.0 1.000 2.000 3.000 5.000
Months_on_book 10127.0 35.928409 7.986416 13.0 31.000 36.000 40.000 56.000
Total_Relationship_Count 10127.0 3.812580 1.554408 1.0 3.000 4.000 5.000 6.000
Months_Inactive_12_mon 10127.0 2.341167 1.010622 0.0 2.000 2.000 3.000 6.000
Contacts_Count_12_mon 10127.0 2.455317 1.106225 0.0 2.000 2.000 3.000 6.000
Credit_Limit 10127.0 8631.953698 9088.776650 1438.3 2555.000 4549.000 11067.500 34516.000
Total_Revolving_Bal 10127.0 1162.814061 814.987335 0.0 359.000 1276.000 1784.000 2517.000
Avg_Open_To_Buy 10127.0 7469.139637 9090.685324 3.0 1324.500 3474.000 9859.000 34516.000
Total_Amt_Chng_Q4_Q1 10127.0 0.759941 0.219207 0.0 0.631 0.736 0.859 3.397
Total_Trans_Amt 10127.0 4404.086304 3397.129254 510.0 2155.500 3899.000 4741.000 18484.000
Total_Trans_Ct 10127.0 64.858695 23.472570 10.0 45.000 67.000 81.000 139.000
Total_Ct_Chng_Q4_Q1 10127.0 0.712222 0.238086 0.0 0.582 0.702 0.818 3.714
Avg_Utilization_Ratio 10127.0 0.274894 0.275691 0.0 0.023 0.176 0.503 0.999

In case of Cred_Limit, Avg_Open_To_Buy, mean is higher than standard deviation, depicting right sckewed data.

The wide gap between the 3rd quartile and the max value in the columns Total_Trans_Amt, Avg_Open_To_Buy and Cred_Limit indicate presence of outliers.

Customer_Age looks to be uniform ranging from 26 to 73 with both the mean and median at ~46 .

Dependent_count has values ranging from 0 to 5 with mean and median set to 2.

Months_on_book range from 13 to 56. Looks to be uniform with mean and median at ~36.

Total_Relationship_Count has values ranging from 1 to 6 with both mean and median at 4 indicating a uniform distribution.

Contacts_Count_12_mon has values which span from 0 to 6 with both mean and median at 2.

Total_Revolving_Bal has a range of 0 to 2517 with mean and median around 1200 indicating a near uniform distribution.

Total_Trans_Ct values range from 10 to 139 with mean and median values in the 60s.

Total_Ct_Chng_Q4_Q1 indicates a uniform distribution with mean and median at 0.7

Avg_Utilization_Ratio ranges from 0 to 1 with higher mean than the median indicating right skewed data with outliers.

Categorical column statistics¶

In [13]:
data.describe(exclude=np.number).T
Out[13]:
count unique top freq
Attrition_Flag 10127 2 Existing Customer 8500
Gender 10127 2 F 5358
Education_Level 8608 6 Graduate 3128
Marital_Status 9378 3 Married 4687
Income_Category 10127 6 Less than $40K 3561
Card_Category 10127 4 Blue 9436

Existing Customers are the majority of the dataset.

Females, Graduate educated and married people are the majority of the data.

These customers tend to make less than 40$k a year and hold Blue type credit cards.

Education_Level and Marital_Status both have data missing.

Data Cleansing¶

Let's check the count of each unique category in each of the categorical variables.

In [14]:
# Making a list of all string/categorical variables
string_columns = data.select_dtypes("object").columns.to_list()

# Printing number of count of each unique value in each column
for column in string_columns:
    print(data[column].value_counts())
    print("-" * 40)
Existing Customer    8500
Attrited Customer    1627
Name: Attrition_Flag, dtype: int64
----------------------------------------
F    5358
M    4769
Name: Gender, dtype: int64
----------------------------------------
Graduate         3128
High School      2013
Uneducated       1487
College          1013
Post-Graduate     516
Doctorate         451
Name: Education_Level, dtype: int64
----------------------------------------
Married     4687
Single      3943
Divorced     748
Name: Marital_Status, dtype: int64
----------------------------------------
Less than $40K    3561
$40K - $60K       1790
$80K - $120K      1535
$60K - $80K       1402
abc               1112
$120K +            727
Name: Income_Category, dtype: int64
----------------------------------------
Blue        9436
Silver       555
Gold         116
Platinum      20
Name: Card_Category, dtype: int64
----------------------------------------

Income_Category column seems to have errorneous data. We will have to replace 'abc' with null values and handle it as part of the imputation process (which should be done after splitting the data in order to avoid Data Leakage).

There are only 20 observations of Platinum Card holders.

We can see that the distribution of classes in the target variable is imbalanced - 8500 existing customers vs 1600+ attrited customers.

In [15]:
# subset to view incorrect values
data[data.Income_Category == "abc"]
Out[15]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
19 Existing Customer 45 F 2 Graduate Married abc Blue 37 6 1 2 14470.0 1157 13313.0 0.966 1207 21 0.909 0.080
28 Existing Customer 44 F 3 Uneducated Single abc Blue 34 5 2 2 10100.0 0 10100.0 0.525 1052 18 1.571 0.000
39 Attrited Customer 66 F 0 Doctorate Married abc Blue 56 5 4 3 7882.0 605 7277.0 1.052 704 16 0.143 0.077
44 Existing Customer 38 F 4 Graduate Single abc Blue 28 2 3 3 9830.0 2055 7775.0 0.977 1042 23 0.917 0.209
58 Existing Customer 44 F 5 Graduate Married abc Blue 35 4 1 2 6273.0 978 5295.0 2.275 1359 25 1.083 0.156
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10021 Attrited Customer 30 F 1 Graduate Married abc Blue 18 4 1 4 4377.0 2517 1860.0 0.941 8759 74 0.609 0.575
10040 Attrited Customer 50 F 3 Doctorate Single abc Blue 36 4 3 3 5173.0 0 5173.0 0.912 8757 68 0.789 0.000
10083 Existing Customer 42 F 4 Uneducated Married abc Blue 23 4 1 2 8348.0 0 8348.0 0.695 15905 111 0.708 0.000
10092 Attrited Customer 40 F 3 Graduate Married abc Blue 25 1 2 3 6888.0 1878 5010.0 1.059 9038 64 0.829 0.273
10119 Attrited Customer 55 F 3 Uneducated Single abc Blue 47 4 3 3 14657.0 2517 12140.0 0.166 6009 53 0.514 0.172

1112 rows × 20 columns

We will replace the errorneous value with null values. In order to avoid data leakage, we will impute the missing values after the data has been split.

In [16]:
# Replacing errorneous values in the Income_Category column with null values
data.Income_Category.replace(to_replace="abc", value=np.nan, inplace=True)
In [17]:
# Verify value replacement
data.Income_Category.value_counts()
Out[17]:
Less than $40K    3561
$40K - $60K       1790
$80K - $120K      1535
$60K - $80K       1402
$120K +            727
Name: Income_Category, dtype: int64

Datatype Conversions¶

In [18]:
#Converting the data type of the category variables from object/float to category
data[string_columns] = data[string_columns].astype('category')
In [19]:
#Verifying the conversion and also checking the memory footprint
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Attrition_Flag            10127 non-null  category
 1   Customer_Age              10127 non-null  int64   
 2   Gender                    10127 non-null  category
 3   Dependent_count           10127 non-null  int64   
 4   Education_Level           8608 non-null   category
 5   Marital_Status            9378 non-null   category
 6   Income_Category           9015 non-null   category
 7   Card_Category             10127 non-null  category
 8   Months_on_book            10127 non-null  int64   
 9   Total_Relationship_Count  10127 non-null  int64   
 10  Months_Inactive_12_mon    10127 non-null  int64   
 11  Contacts_Count_12_mon     10127 non-null  int64   
 12  Credit_Limit              10127 non-null  float64 
 13  Total_Revolving_Bal       10127 non-null  int64   
 14  Avg_Open_To_Buy           10127 non-null  float64 
 15  Total_Amt_Chng_Q4_Q1      10127 non-null  float64 
 16  Total_Trans_Amt           10127 non-null  int64   
 17  Total_Trans_Ct            10127 non-null  int64   
 18  Total_Ct_Chng_Q4_Q1       10127 non-null  float64 
 19  Avg_Utilization_Ratio     10127 non-null  float64 
dtypes: category(6), float64(5), int64(9)
memory usage: 1.1 MB

We now have 6 categorical columns and the memory footprint has come down from ~1.6MB to ~1.1MB.

Exploratory Data Analysis¶

Univariate Analysis

Numerical Feature Analysis

The first step of univariate analysis is to check the distribution/spread of the data. This is done using primarily histograms and box plots.

Let us go ahead and plot histograms and box plots for all the numerical columns.

In [20]:
numerical_col = data.select_dtypes(exclude='category').columns.tolist()
for col in numerical_col:
  sns.histplot(data=data, x=col, kde=True);
  plt.show()

Observations from Numerical Feature Analysis

Customer_Age is normally distributed.

Most customers have 3 dependents followed by customers with 2 and 1 dependents.

Months_on_book - Data looks well distributed on both sides. There are customers who are assocated with bank for almost 5 years.

Total_Relationship_Count - Around 1/4th of customers have 3 relationships with bank. Customers with 4,5 and 6 relationships are almost equal in numbers.

One third of the customers have not used the card for 3 months in last 12 months. There are a 1% subset of customers who have not used their cards for 6 months in last 12 months.

~30% customers had interacted 3 times with bank staff in last 12 months, followed by customers interacted twice. Very small percentage of customers have interacted for 6 times in last 12 months.

From our domain knowledge we know that some customers will have higher credit limit compared to others. The data is expected to be right skewed given the domain.looks

Total_Revolving_Bal - The data is skewed to the left. From the domain knowledge we know that customers may or may not use the revolving credit / balance feature on their cards.

Avg_Open_To_Buy - The data is heavily right skewed. From our domain knowledge, we know that customers may or may not utilize the entire limit available on card. The data looks normal for the domain.

Total_Amt_Chng_Q4_Q1 - The data is skewed to the right. The variable represents a comparision between utilisation in Q4 as compared to Q1. The customers spending patterns change and so the skew is quite expected and not surprising or abnormal.

Total_Trans_Amt - Some customers are likely to spend more on credit card than others. The data is skewed to the right with 4 peaks/modes. The skewing in this column is very much expected.

Total_Trans_Ct - The data largely looks balanced with 2 peaks. Some customers use their credit cards more than others and we cannot expect a ideal normal distribution for this column.

Total_Count_Chng_Q4_Q1 - This graph looks very similar to that of Total_Amt_Chng_Q4_Q1. They both are skewed to the right, peak at the same point. The min and max values listed in the statistical summary are very close as well. Could be because both these columns are rations.

Avg_Utilization_Ratio - The data is skewed to the right. Given that some customers are likely to spend more on credit card, the skewing is expected.

Outlier detection using Boxplot¶

In [21]:
#Let us plot boxplots for all the numerical columns
plt.figure(figsize=(25,5))

for i, variable in enumerate(numerical_col):
    plt.subplot(1,14,i+1)
    sns.boxplot(y=data[variable], orient='h')
    plt.tight_layout()
    plt.title(variable)

plt.show()

Observations on Outliers derived from Box Plots

Customer_Age has outliers. From the statistical summary, the max value in the column is only 73. So no treatment is required.

Months_on_book - There are outliers on both sides. As per the statistical summary, the max time that the customer was associated with the bank is ~5 years and the minimum is ~1 year. Knowing the domain, this is quite expected and does not need any outlier treatment. We might want to confirm that the customers who have been with the bank longer are older customers.

Months_Inactive_12_mon - There are outliers on both ends. However looking at the statistical summary the range is very small - 0 to 6 months. Again knowing the domain, this is quite expected.

Contacts_Count_12_mon - Outliers exist on both ends. However looking at the statistical summary the range is very small - 0 to 6. Again knowing the domain, this is quite expected and does not call for any outlier treatment.

Credit_Limit - Outliers at the higher end. From our domain knowledge we know that some customers will have higher credit limit compared to others. Again, we could validate that the customers with higher limit are older as fresh grads won't start out with a high credit limit as it takes a while to build up the credit history.

Avg_Open_To_Buy - Open to Buy refers to the amount left on the credit card to use (Average of last 12 months). This column has outliers on the higher side. From our domain knowledge, we know that customers may or may not utilize the entire limit available on card. However we can double check that the customers with high Avg_Open_To_Buy have higher Credit_Limit and are older.

Total_Amt_Chng_Q4_Q1 - Outliers exist on either ends. The variable represents a comparision between utilisation of Q4 as compared to Q1. The statistical summary indicates that the values range between 0 to 3.4 - a very small range. Given the domain, this is expected where the same amount wont be spent every quarter - many factors contribute to the amount spent. The data looks legit/expected and no outlier treatment is required.

Total_Trans_Amt (in 12 months) - Outliers exist on the higher end. Some customers are likely to spend more on credit card. We can verify that the customers who spend more have a higher credit limit.

Total_Trans_Ct (in 12 months) - Outliers exist on the higher end. It is again expected that some customers use their card more than others. As per the statistical summary, the values range from 10 to 139. Again this is expected and perhaps we can verify that the customers who use the card the most incur more charges... This isn't necessarily true either - higher transaction count does not have to add up to higher transaction amounts. For example, 10 trips to the grocery store (10 transactions) wont add up to one roundtrip airfare for the family(1 transaction).

Total_Count_Chng_Q4_Q1 - Outliers exist on either ends. The variable represents a comparision between number of transactions in Q4 as compared to Q1. The statistical summary indicates that the values range between 0 to 3.7 - a very small range. Given the domain, this is expected most users wont use the credit card the exact same number of times within a quarter - many factors contribute to the amount spent. The data looks legit/expected and no outlier treatment is required.

Categorical Feature Analysis¶

In [22]:
# Below code draws count plots for each categorical feature along with the percentages

def perc_on_bar(data: pd.DataFrame, category_columns, target, hue=None, perc=True):
    '''
    The function takes a category column as input and plots bar chart with percentages on top of each bar
    '''
    
    subplot_cols = 2
    subplot_rows = int(len(category_columns)/2 + 1)
    plt.figure(figsize=(16,5*subplot_rows))
    for i, col in enumerate(category_columns):
        plt.subplot(subplot_rows,subplot_cols,i+1)
        order = data[col].value_counts(ascending=False).index  # Data order  
        ax=sns.countplot(data=data, x=col, palette = 'crest', order=order, hue=hue);
        for p in ax.patches:
            percentage = '{:.1f}%\n({})'.format(100 * p.get_height()/len(data[target]), p.get_height())
            # Added percentage and actual value
            x = p.get_x() + p.get_width() / 2
            y = p.get_y() + p.get_height() + 40
            if perc:
                plt.annotate(percentage, (x, y), ha='center', color='black', fontsize='medium'); # Annotation on top of bars
            plt.xticks(color='black', fontsize='medium', rotation= (-90 if col=='region' else 0));
            plt.tight_layout()
            plt.title(col.capitalize() + ' Percentage Count Plots \n\n')
In [23]:
perc_on_bar(data, string_columns, 'Attrition_Flag')

Observations from Categorical Feature Analysis

Target variable shows 83.9% customers are existing and 16.1% are attrited.

Number of Female customers is higher than Male customers.

Most customers are Graduate followed by High School.

Around 46% customers are married.

35% of them make less than 40K followed by ~18% of them who make between 40 to 60K.

Most customers in the dataset have "Blue" Card. There are very few customers who use "Gold" and "Platinum" Cards.

Bivariate Analysis¶

Goal of Bi-variate analysis is to find inter-dependencies between features.

In [24]:
#Let us draw count plots of the categorical variables grouped by the dependent/target variable (Attrition_Flag)
for col in string_columns:
  plt.figure(figsize=(15,5))
  sns.countplot(data=data, x=col, hue='Attrition_Flag');
  plt.show()

Observations drawn from the Bivariate Analysis of the Categorical variables

Number of Male Customers closing the Credit Card account is slighlty lesser than Female Customers.

People with Doctorate degree tend to leave the Credit Card program more than other Education_Levels. Education_Level other than Doctorate do not show significant impact on attrition.

Married Customers are less likely to close the credit card compared to other Statuses.

Customers with income greater than 120k and less than 40k are more likely to close the credit card. Customers within income range of 60k and 80k are less likely to attrite.

Platinum card members are most likely to close the credit card compared to other categories.

In [25]:
# #Let us draw box plots of the numerical columns against the dependent/target variable (Attrition_Flag)
plt.figure(figsize=(20,30))
for i, variable in enumerate(numerical_col):
                     plt.subplot(5,3,i+1)
                     sns.boxplot(data["Attrition_Flag"],data[variable],palette="Set1")
                     plt.tight_layout()
                     plt.title(variable)
plt.show()

Observations drawn from the Bivariate Analysis of the Numerical variables

Existing customers have slightly larger Credit_Limits.

Existing customers have larger Revolving_Balance.

Transaction_Amounts and transaction counts of existing customers are higher than attrited customers, which indicates some customers may be using other credit cards or really don't use credit cards at all.

Average Utilisation is higher in existing customers.

The number of times the bank staff contacted the attrited customers is higher than the existing customers.

Customer_Age, Avg_to_Buy, Months_Inactive_12_mon and Months_on_Book do not seem to have an impact on attrition.

Multi-variate Analysis¶

Pair Plot of the numerical columns hued by Attrition_Flag

In [26]:
## Below plot shows correlations between the numerical features in the dataset

plt.figure(figsize=(20,20));
sns.set(palette="nipy_spectral");
sns.pairplot(data=data, hue='Attrition_Flag', corner=True);
plt.show()
<Figure size 1440x1440 with 0 Axes>

Observations As we have seen during bivariate analysis there are no clear indicators of attrition as all variables show similar data for both existing customers and attrited customers.

Heatmap to plot the correlations between the numerical columns.¶

In [27]:
plt.figure(figsize=(15,5))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, cmap='Spectral');

Observations

Avg_Open_To_Buy is highly correlated to Credit_Limit.

Total_Trans_Amt is highly correlated to Total_Trans_Ct as they are value and volume representation of Transactional Data.

Avg_Utilization_Ratio shows strong correlation with Total_Revolving_Bal.

Customer_Age as expected is strongly correlated to Months_on_book.

Avg_Utilization_Ratio is negatively correlated to Credit_Limit and Avg_Open_To_Buy.

Outlier Treatment¶

When analyzing the outliers indicated by the box plots, we noted 5 columns for further analysis. Recap of the observations of the 5 columns below.

Months_on_book - There are outliers on both sides. As per the statistical summary, the max time that the customer was associated with the bank is ~5 years and the minimum is ~1 year. Knowing the domain, this is quite expected. We could check if the customers who have higher values are older customers.

Credit_Limit - Outliers at the higher end. From our domain knowledge we know that some customers will have higher credit limit compared to others. Again, we could validate that the customers with higher limit are older as fresh grads won't start out with a high credit limit as it takes a while to build up the credit history.

Avg_Open_To_Buy - Open to Buy refers to the amount left on the credit card to use (Average of last 12 months). This column has outliers on the higher side. From our domain knowledge, we know that customers may or may not utilize the entire limit available on card. However we can double check that the customers with high Avg_Open_To_Buy have higher Credit_Limit and are older.

Total_Trans_Amt (in 12 months) - Outliers exist on the higher end and that is justifiable as some customers are likely to spend more than others. We can verify that the customers who spend more have a higher credit limit.

Total_Trans_Ct (in 12 months) - Outliers exist on the higher end. It is again expected that some customers use their card more than others. As per the statistical summary, the values range from 10 to 139. Again this is expected and perhaps we can verify that the customers who use the card the most incur more charges. This doesn't hold good in every case - higher transaction count does not have to add up to higher transaction amounts. For example, 10 trips to the grocery store (10 transactions) wont add up to one roundtrip airfare for the family(1 transaction).

Function to determine the lower and upper limits based on the Quantile and IQR

In [28]:
#Determine the lower_limit and upper_limit (Formulas: Q1-1.5*IQR and Q3+1.5*IQR respectively)
def calculate_limits(data,colmn):
  # Q1 = 25th quantile
    Q1=data[colmn].quantile(0.25)
    
    # Q3 = 75th quantile
    Q3=data[colmn].quantile(0.75)
    
    # Calculating IQR
    IQR=Q3-Q1
    
    # Setting lower limit to Q1-1.5*IQR
    lower_limit = Q1 - 1.5*IQR
    if lower_limit < 0:
        lower_limit = 0

    # Setting upper limit to Q3+1.5*IQR
    upper_limit = Q3 + 1.5*IQR
    
    return lower_limit, upper_limit
In [29]:
# Let us calculate the lower and upper limits of Months_on_book using quantiles.
lower_limit, upper_limit = calculate_limits(data,'Months_on_book')
print ('Lower cap for Months_on_book',lower_limit)
print ('Upper cap for Months_on_book', upper_limit)
Lower cap for Months_on_book 17.5
Upper cap for Months_on_book 53.5
In [30]:
#Let us get the number of rows where Months_on_book>53.5 & let us see what values those rows hold.
temp = data[data['Months_on_book']>53.5]
print('Number of rows with more than 53.5 months_on_book:', temp.shape[0])
temp['Months_on_book'].value_counts(1)
Number of rows with more than 53.5 months_on_book: 198
Out[30]:
56    0.520202
54    0.267677
55    0.212121
Name: Months_on_book, dtype: float64

None of the 198 rows have a Month_on_book that is significantly higher than the upper limit. There are merely 1 to 2.5 more than the upper limit.

In [31]:
#Let us check the age of the customers who have higher values in Months_on_book
#Most likely older clients are the ones who have been with the bank for longer compared to customers in 20s.
temp['Customer_Age'].value_counts()
Out[31]:
65    59
63    25
61    24
62    24
60    23
64    21
59     8
58     7
67     3
66     2
68     1
70     1
Name: Customer_Age, dtype: int64
In [32]:
#Let us get the number of rows where Months_on_book<17.5 & let us see what values those rows hold.
temp = data[data['Months_on_book']<17.5]
print('Number of rows with less than 17.5 in months_on_book:', temp.shape[0])
temp['Months_on_book'].value_counts(1)
Number of rows with less than 17.5 in months_on_book: 188
Out[32]:
13    0.372340
17    0.207447
15    0.180851
16    0.154255
14    0.085106
Name: Months_on_book, dtype: float64

Majority of these customers have been with the bank for atleast 13 months (a year). Also customers new to the bank do not have to be younger as they could be customers who recently transfered from a different bank. So age is not a relevant check for the customers with Months_on_book below 17.5.

Bottom Line - Outlier Treatment on Months_on_book

Given the domain, these outliers are expected and do not warranty any treatment.

In [33]:
# Let us calculate the lower and upper limits of Credit_limit using quantiles.
lower_limit, upper_limit = calculate_limits(data,'Credit_Limit')
print ('Lower cap for Credit_limit',lower_limit)
print ('Upper cap for Credit_limit', upper_limit)
Lower cap for Credit_limit 0
Upper cap for Credit_limit 23836.25
In [34]:
#Let us get the number of rows where Credit_Limit > 24K & let us see what values those rows hold.
temp = data[data['Credit_Limit']>24000]
print('Number of rows with more than 24k in Credit_Limit:', temp.shape[0])
temp['Credit_Limit'].value_counts(1)
Number of rows with more than 24k in Credit_Limit: 958
Out[34]:
34516.0    0.530271
31501.0    0.002088
30210.0    0.002088
26792.0    0.002088
32535.0    0.002088
             ...   
32676.0    0.001044
32446.0    0.001044
28634.0    0.001044
32658.0    0.001044
29663.0    0.001044
Name: Credit_Limit, Length: 442, dtype: float64

53% of these customers have a Credit_Limit of 34K. Which is good 10K above the upper_limit, however this is quite expected given the domain.

In [35]:
#Let us take a look at the age of these customers to make sure there aren't many in the 20s. As it takes a while to build up
#credit history and get a good credit limit.
temp['Customer_Age'].value_counts()
Out[35]:
51    60
44    58
49    57
50    57
46    52
52    50
43    50
48    49
41    47
47    46
45    44
42    43
39    36
40    34
38    33
55    33
53    29
57    24
54    23
56    14
36    12
34    12
58    11
37    11
60    10
31    10
33     9
35     9
32     8
59     7
62     6
61     3
63     2
28     2
29     2
26     2
30     1
65     1
27     1
Name: Customer_Age, dtype: int64

Out of 958 customers we only got about 9 of them that are between 26 and 29. Again expected and this is less than 1% of the customer base that has a high credit limit.

Bottom Line - Credit_Limit - Outlier Treatment

The outliers do not warranty any treatment.

In [36]:
# Let us calculate the lower and upper limits of Avg_Open_To_Buy using quantiles.
lower_limit, upper_limit = calculate_limits(data,'Avg_Open_To_Buy')
print ('Lower cap for Avg_Open_To_Buy',lower_limit)
print ('Upper cap for Avg_Open_To_Buy', upper_limit)
Lower cap for Avg_Open_To_Buy 0
Upper cap for Avg_Open_To_Buy 22660.75
In [37]:
#Let us get the number of rows where Avg_Open_To_Buy > 23K & let us see what values those rows hold.
temp = data[data['Avg_Open_To_Buy']>23000]
print('Number of rows with more than 23k in Avg_Open_To_Buy:', temp.shape[0])
temp['Avg_Open_To_Buy'].value_counts(1)
Number of rows with more than 23k in Avg_Open_To_Buy: 941
Out[37]:
34516.0    0.104145
31999.0    0.027630
32669.0    0.003188
32864.0    0.003188
23981.0    0.003188
             ...   
32623.0    0.001063
28701.0    0.001063
24024.0    0.001063
30858.0    0.001063
33425.0    0.001063
Name: Avg_Open_To_Buy, Length: 760, dtype: float64
In [38]:
#Let us make sure there aren't any rows where Avg_Open_To_Buy is greater than the allowed credit limit. 
#Existance of such rows indicate erroneous data.
temp1 = temp[temp['Avg_Open_To_Buy'] > temp['Credit_Limit']]
temp1.shape[0]
Out[38]:
0
In [39]:
#Let us confirm that majority of these customers have a high credit limit. 
temp['Credit_Limit'].value_counts(1)
Out[39]:
34516.0    0.539851
23981.0    0.004251
30210.0    0.002125
26437.0    0.002125
32535.0    0.002125
             ...   
31945.0    0.001063
27710.0    0.001063
26856.0    0.001063
26218.0    0.001063
29663.0    0.001063
Name: Credit_Limit, Length: 423, dtype: float64

53% of the customers have a high credit limit of 34.5K. A good indicator that these are legit observations.

Bottom Line - Avg_Open_To_Buy - Outlier Treatment

The outliers do not warranty any treatment.

In [40]:
# Let us calculate the lower and upper limits of Total_Trans_Amt using quantiles.
lower_limit, upper_limit = calculate_limits(data,'Total_Trans_Amt')

print ('Lower cap for Total_Trans_Amt',lower_limit)
print ('Upper cap for Total_Trans_Amt', upper_limit)
Lower cap for Total_Trans_Amt 0
Upper cap for Total_Trans_Amt 8619.25
In [41]:
#Let us get the number of rows where Total_Trans_Amt> 9K & let us see what values those rows hold.
temp = data[data['Total_Trans_Amt']>9000]
print('Number of rows with more than 9k in Total_Trans_Amt:', temp.shape[0])
temp['Total_Trans_Amt'].value_counts(1)
Number of rows with more than 9k in Total_Trans_Amt: 819
Out[41]:
14738    0.003663
14802    0.003663
14220    0.003663
13988    0.003663
13360    0.002442
           ...   
13299    0.001221
9177     0.001221
13853    0.001221
13958    0.001221
10294    0.001221
Name: Total_Trans_Amt, Length: 747, dtype: float64
In [42]:
#Let us verify that these rows have optimal values in the Total_Trans_Ct columns. 
temp['Total_Trans_Ct'].min()
Out[42]:
50

This is total amount spent over a period of 12 months. So quite natural to have a wide range of values. The same goes for Total_Trans_Ct column as well

Bottom Line - Total_Trans_Amt & Total_Trans_Ct - Outlier Treatment

The outliers do not warranty any treatment.

Customer Profiling¶

Let us create a dataset with just the attrited customers.

In [43]:
temp = data.copy()
temp['Attrition_Flag'] = temp['Attrition_Flag'].replace({"Attrited Customer": 1, "Existing Customer": 0})
temp = temp[temp['Attrition_Flag'] ==1]
temp.describe().T
Out[43]:
count mean std min 25% 50% 75% max
Customer_Age 1627.0 46.659496 7.665652 26.0 41.0000 47.000 52.000 68.000
Dependent_count 1627.0 2.402581 1.275010 0.0 2.0000 2.000 3.000 5.000
Months_on_book 1627.0 36.178242 7.796548 13.0 32.0000 36.000 40.000 56.000
Total_Relationship_Count 1627.0 3.279656 1.577782 1.0 2.0000 3.000 5.000 6.000
Months_Inactive_12_mon 1627.0 2.693301 0.899623 0.0 2.0000 3.000 3.000 6.000
Contacts_Count_12_mon 1627.0 2.972342 1.090537 0.0 2.0000 3.000 4.000 6.000
Credit_Limit 1627.0 8136.039459 9095.334105 1438.3 2114.0000 4178.000 9933.500 34516.000
Total_Revolving_Bal 1627.0 672.822987 921.385582 0.0 0.0000 0.000 1303.500 2517.000
Avg_Open_To_Buy 1627.0 7463.216472 9109.208129 3.0 1587.0000 3488.000 9257.500 34516.000
Total_Amt_Chng_Q4_Q1 1627.0 0.694277 0.214924 0.0 0.5445 0.701 0.856 1.492
Total_Trans_Amt 1627.0 3095.025814 2308.227629 510.0 1903.5000 2329.000 2772.000 10583.000
Total_Trans_Ct 1627.0 44.933620 14.568429 10.0 37.0000 43.000 51.000 94.000
Total_Ct_Chng_Q4_Q1 1627.0 0.554386 0.226854 0.0 0.4000 0.531 0.692 2.500
Avg_Utilization_Ratio 1627.0 0.162475 0.264458 0.0 0.0000 0.000 0.231 0.999
In [44]:
temp.describe(exclude=np.number).T
Out[44]:
count unique top freq
Attrition_Flag 1627 1 1 1627
Gender 1627 2 F 930
Education_Level 1371 6 Graduate 487
Marital_Status 1498 3 Married 709
Income_Category 1440 5 Less than $40K 612
Card_Category 1627 4 Blue 1519

Attrition Indicators

The attrited customers ---->

are primarily in the age group of 41 to 52

have been with the bank between 2.5 to 3.5 years

have not used the credit card for 2 to 3 months in the last 12 months

have a credit limit between ~2K and ~10K

have a revolving balance between 0 and 1.3K

spend anywhere between ~1.9K to ~2.7K annually

use the credit card between 37 to 51 times annually

majority of them are married female graduates with income less than 40K.

Data Preparation for Modeling¶

In [45]:
#Encoding the target variable
data['Attrition_Flag'] = data['Attrition_Flag'].replace({"Attrited Customer": 1, "Existing Customer": 0})
In [46]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Attrition_Flag            10127 non-null  category
 1   Customer_Age              10127 non-null  int64   
 2   Gender                    10127 non-null  category
 3   Dependent_count           10127 non-null  int64   
 4   Education_Level           8608 non-null   category
 5   Marital_Status            9378 non-null   category
 6   Income_Category           9015 non-null   category
 7   Card_Category             10127 non-null  category
 8   Months_on_book            10127 non-null  int64   
 9   Total_Relationship_Count  10127 non-null  int64   
 10  Months_Inactive_12_mon    10127 non-null  int64   
 11  Contacts_Count_12_mon     10127 non-null  int64   
 12  Credit_Limit              10127 non-null  float64 
 13  Total_Revolving_Bal       10127 non-null  int64   
 14  Avg_Open_To_Buy           10127 non-null  float64 
 15  Total_Amt_Chng_Q4_Q1      10127 non-null  float64 
 16  Total_Trans_Amt           10127 non-null  int64   
 17  Total_Trans_Ct            10127 non-null  int64   
 18  Total_Ct_Chng_Q4_Q1       10127 non-null  float64 
 19  Avg_Utilization_Ratio     10127 non-null  float64 
dtypes: category(6), float64(5), int64(9)
memory usage: 1.1 MB
In [47]:
data.head(5)
Out[47]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 0 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.0 777 11914.0 1.335 1144 42 1.625 0.061
1 0 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.0 864 7392.0 1.541 1291 33 3.714 0.105
2 0 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.0 0 3418.0 2.594 1887 20 2.333 0.000
3 0 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 3313.0 2517 796.0 1.405 1171 20 2.333 0.760
4 0 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.0 0 4716.0 2.175 816 28 2.500 0.000
In [48]:
X = data.drop('Attrition_Flag', axis=1)
y = data['Attrition_Flag']
In [49]:
data['Attrition_Flag'].value_counts(1)
Out[49]:
0    0.83934
1    0.16066
Name: Attrition_Flag, dtype: float64

84% of observations belong to existing customers (mapped to 0) and 16% of the observations belong to attrited customers (mapped to 1), and this ratio will have to be preserved in the training, validation and the testing datasets.

Splitting data into 3 sets - Training, Validation and Testing.¶

In [50]:
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.20, random_state=1, stratify=y
)

# Then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp,
    y_temp,
    test_size=0.25,
    random_state=1,
    stratify=y_temp,  # set the weighting feature on
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 19) (2026, 19) (2026, 19)

Verifying that the ratio of the classes in the target variable has been preserved across the 3 datasets.

In [51]:
y_train.value_counts(1)
Out[51]:
0    0.839342
1    0.160658
Name: Attrition_Flag, dtype: float64
In [52]:
y_val.value_counts(1)
Out[52]:
0    0.839092
1    0.160908
Name: Attrition_Flag, dtype: float64
In [53]:
y_test.value_counts(1)
Out[53]:
0    0.839585
1    0.160415
Name: Attrition_Flag, dtype: float64

Imputation¶

In order to avoid Data Leakage, Imputation is being done after the data has been split into Train, Validation and Testing datasets.

In [54]:
#The 3 columns with missing data are categorical and hence using the mode(most-frequent) value to fill in the missing values.
cols_to_impute = ['Education_Level','Income_Category','Marital_Status']
imputer = SimpleImputer(strategy='most_frequent')

# fit the imputer on the training set and transform it on all the 3 - training, validation and testing datasets.
X_train[cols_to_impute] = imputer.fit_transform(X_train[cols_to_impute])
X_val[cols_to_impute] = imputer.transform(X_val[cols_to_impute])
X_test[cols_to_impute] = imputer.transform(X_test[cols_to_impute])
In [55]:
# Verifiying that no column has missing values in training, validation or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
------------------------------
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
------------------------------
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64

Ordinal Encoding¶

We have 3 categorical columns (Card_Category, Education_Level and Income_Category) where ranking/order matters.

Listing the unique values in the 3 columns to come up with the ranking/order.

In [56]:
X_train['Card_Category'].unique()
Out[56]:
['Blue', 'Silver', 'Gold', 'Platinum']
Categories (4, object): ['Blue', 'Gold', 'Platinum', 'Silver']
In [57]:
X_train['Education_Level'].unique()
Out[57]:
array(['Graduate', 'High School', 'Uneducated', 'College', 'Doctorate',
       'Post-Graduate'], dtype=object)
In [58]:
X_train['Income_Category'].unique()
Out[58]:
array(['$120K +', 'Less than $40K', '$80K - $120K', '$60K - $80K',
       '$40K - $60K'], dtype=object)
In [59]:
# Create order categories to instruct how ordinal encoder should rank the values.
eduction_order = ['Uneducated','High School','College','Graduate','Post-Graduate','Doctorate']
card_order = ["Blue", "Silver", "Gold", 'Platinum']
income_order = ['Less than $40K','$40K - $60K', '$60K - $80K','$80K - $120K', '$120K +']

# Assign attributes to different lists based on the values
column_to_cat = {
    "Education_Level": eduction_order,
    "Card_Category": card_order,
    "Income_Category": income_order
    
}

# Initialize the 3 encoders - if we encounter values in validation or testing data sets
# that do not exist in the training set, the encoder is configured to raise an error (handle_unknown="error") 
encoder_education = OrdinalEncoder(categories=[eduction_order],handle_unknown="error")
encoder_card = OrdinalEncoder(categories=[card_order],handle_unknown="error")
encoder_income = OrdinalEncoder(categories=[income_order],handle_unknown="error")

#Fit the encoder on the training set and transform it on all the 3 datasets - training, validation and testing.
X_train['Education_Level'] = encoder_education.fit_transform(X_train[['Education_Level']])
X_train['Card_Category'] = encoder_card.fit_transform(X_train[['Card_Category']])
X_train['Income_Category'] = encoder_income.fit_transform(X_train[['Income_Category']])

X_val['Education_Level'] = encoder_education.transform(X_val[['Education_Level']])
X_val['Card_Category'] = encoder_card.transform(X_val[['Card_Category']])
X_val['Income_Category'] = encoder_income.transform(X_val[['Income_Category']])

X_test['Education_Level'] = encoder_education.transform(X_test[['Education_Level']])
X_test['Card_Category'] = encoder_card.transform(X_test[['Card_Category']])
X_test['Income_Category'] = encoder_income.transform(X_test[['Income_Category']])

Verifiying that the 3 columns have been encoded.¶

In [60]:
X_test['Card_Category'].value_counts()
Out[60]:
0.0    1876
1.0     119
2.0      26
3.0       5
Name: Card_Category, dtype: int64

value_counts lists the grouped values in the descending order, so most frequent value is listed first. Blue cards formed the majority and we wanted it assigned the lowest order.

In [61]:
X_train['Education_Level'].value_counts()
Out[61]:
3.0    2782
1.0    1228
0.0     881
2.0     618
4.0     312
5.0     254
Name: Education_Level, dtype: int64

value_counts lists the grouped values in the descending order, so most frequent value is listed first. Graduates formed the majority of the customer base and we wanted it ranked after Uneducated(ranking=0), High School(ranking=1) and College(ranking=2)

In [62]:
X_val['Income_Category'].value_counts()
Out[62]:
0.0    957
1.0    361
3.0    293
2.0    279
4.0    136
Name: Income_Category, dtype: int64

value_counts lists the grouped values in the descending order, so most frequent value is listed first. Customers making less than $40k formed the majority and we ranked this the lowest.

Taking a look at the updated column dtype composition of the dataset. The dataset should now have 2 non-numeric columns.

In [63]:
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6075 entries, 800 to 4035
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Customer_Age              6075 non-null   int64   
 1   Gender                    6075 non-null   category
 2   Dependent_count           6075 non-null   int64   
 3   Education_Level           6075 non-null   float64 
 4   Marital_Status            6075 non-null   object  
 5   Income_Category           6075 non-null   float64 
 6   Card_Category             6075 non-null   float64 
 7   Months_on_book            6075 non-null   int64   
 8   Total_Relationship_Count  6075 non-null   int64   
 9   Months_Inactive_12_mon    6075 non-null   int64   
 10  Contacts_Count_12_mon     6075 non-null   int64   
 11  Credit_Limit              6075 non-null   float64 
 12  Total_Revolving_Bal       6075 non-null   int64   
 13  Avg_Open_To_Buy           6075 non-null   float64 
 14  Total_Amt_Chng_Q4_Q1      6075 non-null   float64 
 15  Total_Trans_Amt           6075 non-null   int64   
 16  Total_Trans_Ct            6075 non-null   int64   
 17  Total_Ct_Chng_Q4_Q1       6075 non-null   float64 
 18  Avg_Utilization_Ratio     6075 non-null   float64 
dtypes: category(1), float64(8), int64(9), object(1)
memory usage: 907.8+ KB

Gender and Marital_Status columns are still in the non-numeric format and need to be one-hot encoded. Note that SimpleImputer seems to change the dtype of a column(Marital_Status) from Category to Object.

One Hot Encoding¶

Applying onehot encoding to Gender and Marital_Status columns. We will not drop any of the resulting columns to accomodate any values that exist in the validation/testing sets but not in the training sets.

In [64]:
X_train.columns
Out[64]:
Index(['Customer_Age', 'Gender', 'Dependent_count', 'Education_Level',
       'Marital_Status', 'Income_Category', 'Card_Category', 'Months_on_book',
       'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
      dtype='object')
In [65]:
# Create a OneHotEncoder object
onehot_encoder = OneHotEncoder(sparse=False, handle_unknown="ignore") #setting sparse to False in order to get back an array
cols_to_ohe = ['Gender','Marital_Status']

#Fit the encoder on the training set and transform it on all the 3 datasets - training, validation and testing
encoded_columns_train = onehot_encoder.fit_transform(X_train[cols_to_ohe])
# Converting the array into a dataframe
data_train_encoded = pd.DataFrame(
   encoded_columns_train,
   index=X_train.index,
   columns=onehot_encoder.get_feature_names_out(cols_to_ohe),
)
# Extract the columns that were not sent through the onehot encoder
data_other_cols_train = X_train.drop(columns=cols_to_ohe)
# Recreate the dataframe
X_train = pd.concat([data_train_encoded, data_other_cols_train], axis=1)

# Transform the validation set
encoded_columns_validation = onehot_encoder.transform(X_val[cols_to_ohe])
# Convert the array into a dataframe
data_val_encoded = pd.DataFrame(
   encoded_columns_validation,
   index=X_val.index,
   columns=onehot_encoder.get_feature_names_out(cols_to_ohe),
)
# Extract the columns that were not sent through the onehot encoder
data_other_cols_validation = X_val.drop(columns=cols_to_ohe)
# Recreate the dataframe
X_val = pd.concat([data_val_encoded, data_other_cols_validation], axis=1)


# Transform the test set
encoded_columns_test = onehot_encoder.transform(X_test[cols_to_ohe])
# Convert the array into a dataframe
data_test_encoded = pd.DataFrame(
   encoded_columns_test,
   index=X_test.index,
   columns=onehot_encoder.get_feature_names_out(cols_to_ohe),
)
# Extract the columns that were not sent through the onehot encoder
data_other_cols_test = X_test.drop(columns=cols_to_ohe)
# Recreate the dataframe
X_test = pd.concat([data_test_encoded, data_other_cols_test], axis=1)
In [66]:
#Let us take a look at the size of the dataset across the 3 partitions - we should have 3 more columns
#as Gender column has 2 unique values and Marital_Status had 3 unique values. So OneHotEncoder splits the
#Gender column into 2 and the Marital_Status column into 3. So that leaves us with 3 more columns in number.
#Neither of the encoding techniques (ordinal or onehot) should make any changes to the number of rows.
print('Training data set has {} rows and {} columns'.format(X_train.shape[0],X_train.shape[1]))
print('Validation data set has {} rows and {} columns'.format(X_val.shape[0],X_val.shape[1]))
print('Test data set has {} rows and {} columns'.format(X_test.shape[0],X_test.shape[1]))
Training data set has 6075 rows and 22 columns
Validation data set has 2026 rows and 22 columns
Test data set has 2026 rows and 22 columns

Let us inspect the columns in the dataset.

In [67]:
X_train.columns
Out[67]:
Index(['Gender_F', 'Gender_M', 'Marital_Status_Divorced',
       'Marital_Status_Married', 'Marital_Status_Single', 'Customer_Age',
       'Dependent_count', 'Education_Level', 'Income_Category',
       'Card_Category', 'Months_on_book', 'Total_Relationship_Count',
       'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit',
       'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1',
       'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1',
       'Avg_Utilization_Ratio'],
      dtype='object')

Gender and Marital_Status columns are replaced by Gender_F, Gender_M, Marital_Status_Divorced, Marital_Status_Married and Marital_Status_Single.

In [68]:
X_train.head(5)
Out[68]:
Gender_F Gender_M Marital_Status_Divorced Marital_Status_Married Marital_Status_Single Customer_Age Dependent_count Education_Level Income_Category Card_Category ... Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
800 0.0 1.0 0.0 0.0 1.0 40 2 3.0 4.0 0.0 ... 4 3 20056.0 1602 18454.0 0.466 1687 46 0.533 0.080
498 0.0 1.0 0.0 1.0 0.0 44 1 3.0 0.0 0.0 ... 2 0 2885.0 1895 990.0 0.387 1366 31 0.632 0.657
4356 0.0 1.0 0.0 1.0 0.0 48 4 1.0 3.0 0.0 ... 1 2 6798.0 2517 4281.0 0.873 4327 79 0.881 0.370
407 0.0 1.0 0.0 1.0 0.0 41 2 3.0 2.0 1.0 ... 2 0 27000.0 0 27000.0 0.610 1209 39 0.300 0.000
8728 0.0 1.0 1.0 0.0 0.0 46 4 1.0 1.0 1.0 ... 2 3 15034.0 1356 13678.0 0.754 7737 84 0.750 0.090

5 rows × 22 columns

Let us confirm that the dataset no longer has non-numeric columns and proceed to model building.

In [69]:
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6075 entries, 800 to 4035
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Gender_F                  6075 non-null   float64
 1   Gender_M                  6075 non-null   float64
 2   Marital_Status_Divorced   6075 non-null   float64
 3   Marital_Status_Married    6075 non-null   float64
 4   Marital_Status_Single     6075 non-null   float64
 5   Customer_Age              6075 non-null   int64  
 6   Dependent_count           6075 non-null   int64  
 7   Education_Level           6075 non-null   float64
 8   Income_Category           6075 non-null   float64
 9   Card_Category             6075 non-null   float64
 10  Months_on_book            6075 non-null   int64  
 11  Total_Relationship_Count  6075 non-null   int64  
 12  Months_Inactive_12_mon    6075 non-null   int64  
 13  Contacts_Count_12_mon     6075 non-null   int64  
 14  Credit_Limit              6075 non-null   float64
 15  Total_Revolving_Bal       6075 non-null   int64  
 16  Avg_Open_To_Buy           6075 non-null   float64
 17  Total_Amt_Chng_Q4_Q1      6075 non-null   float64
 18  Total_Trans_Amt           6075 non-null   int64  
 19  Total_Trans_Ct            6075 non-null   int64  
 20  Total_Ct_Chng_Q4_Q1       6075 non-null   float64
 21  Avg_Utilization_Ratio     6075 non-null   float64
dtypes: float64(13), int64(9)
memory usage: 1.1 MB

Model Building¶

Model Evaluation

Model evaluation criterion

Model can make wrong predictions as:

Predicting a customer would close their credit card account and the customer doesn't - Loss of resources

Predicting a customer will not close their credit card account and the customer does - Loss of opportunity

Which case is more important? Predicting a customer will not close their credit card account when the customer actually does close the card. This means this customer will not be targeted by the business team in an effort to retain and hence the bank could lose an existing customer.

How to reduce this loss i.e reduce False Negatives? The company would want the recall to be maximized. Greater the recall score, higher are the chances of minimizing the False Negatives.

BottomLine - We want the False Negatives to be at a minimum.

Class Weightage

Since the frequency of class 0 (existing customers) is 84% and the frequency of class 1 (attrited customers) is %16, class 0 will become the dominant class and the algorithms will become biased towards the dominant classes. To handle this imbalanced data set, we can pass a dictionary {0:0.16,1:0.84} to the model to specify the weight of each class and the algorithms will give more weightage to class 1.

Candidates for the 6 Classification Models

Logistic Regression

Decision Tree

2 Bagging Classification Algorithms

Bagging

Random Forest

2 Boosting Classification Algorithms

Gradient Boosting

Xtreme Gradient Boosting

Hyperparameter Tuning

Let's start by building the models using KFold and cross_val_score and tune the top 3 models using RandomizedSearchCV.

Stratified K-Folds cross-validation provides dataset indices to split data into train/validation sets. Split dataset into k consecutive folds (without shuffling by default) keeping the distribution of both classes in each fold the same as the target variable. Each fold is then used once as validation while the k - 1 remaining folds form the training set.

Model Evaluation Functions - Scoring & Confusion Matrix¶

In [70]:
#Function to draw the confusion matrix
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
In [71]:
##  Function to calculate different metric scores of the model - Accuracy, Recall, Precision and F1
def get_metrics_score(model,X,y):
    '''
    model : classifier to predict values of X

    '''
    # defining an empty list to store training, validation and test results
    score_list=[] 
    
    pred_X = model.predict(X)
    pred_val = model.predict(X_val)
    pred_test = model.predict(X_test)
    
    X_acc = model.score(X,y)
    val_acc = model.score(X_val,y_val)
    test_acc = model.score(X_test,y_test)
    
    X_recall = recall_score(y,pred_X)
    val_recall = recall_score(y_val,pred_val)
    test_recall = recall_score(y_test,pred_test)
    
    X_precision = precision_score(y,pred_X)
    val_precision = precision_score(y_val,pred_val)
    test_precision = precision_score(y_test,pred_test)

    X_f1 = f1_score(y,pred_X)
    val_f1 = f1_score(y_val,pred_val)
    test_f1 = f1_score(y_test,pred_test)
    
    
    score_list.extend((X_acc,test_acc,X_recall,test_recall,X_precision,test_precision,X_f1,test_f1))
        
     
    print("Accuracy on training set : ",X_acc)
    print("Accuracy on validation set : ",val_acc)
    print("Accuracy on test set : ",test_acc)
    print("Recall on training set : ",X_recall)
    print("Recall on validation set : ",val_recall)
    print("Recall on test set : ",test_recall)
    print("Precision on training set : ",X_precision)
    print("Precision on validation set : ",val_precision)
    print("Precision on test set : ",test_precision)
    print("F1 on training set : ",X_f1)
    print("F1 on validation set : ",val_f1)
    print("F1 on test set : ",test_f1)
    
    return score_list # returning the list with train and test scores
In [72]:
# Defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []

def add_score_model(score):
    '''Add scores to list so we can compare metrics generated by all the models'''   
    acc_train.append(score[0])
    acc_test.append(score[1])
    recall_train.append(score[2])
    recall_test.append(score[3])
    precision_train.append(score[4])
    precision_test.append(score[5])
    f1_train.append(score[6])
    f1_test.append(score[7])

Function to

Build the 6 models using the data supplied

Fits the data on the model and calculates the cross validation scores by splitting the dataset into 5 folds

Plots a box plot based on the cv scores calculated

In [73]:
def build_models_calculate_cv_metrics(X,y,title_for_boxplots):
  
  models = []  # Empty list to store all the models

  # Appending models into the list
  models.append(("LogisticRegression",LogisticRegression(random_state=1)))
  models.append(("DecisionTree",DecisionTreeClassifier(random_state=1)))
  models.append(("Bagging", BaggingClassifier(random_state=1)))
  models.append(("Random forest", RandomForestClassifier(random_state=1)))
  models.append(("GBM", GradientBoostingClassifier(random_state=1)))
  models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))

  cv_scores = []  # Empty list to store all model's CV scores
  model_names = []  # Empty list to store name of the models


  # loop through all models to get the mean cross validated score
  print("\n" "Cross-Validation Performance:" "\n")

  for name, model in models:
      scoring = "recall"
      kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)  # Setting number of splits equal to 5
      cv_result = cross_val_score(
      estimator=model, X=X, y=y, scoring=scoring, cv=kfold
    )
      cv_scores.append(cv_result)
      model_names.append(name)
      print("{}: {}".format(name, cv_result.mean()))

  print("\n" "Training Performance:" "\n")

  for name, model in models:
    model.fit(X, y)
    scores = recall_score(y, model.predict(X))
    print("{}: {}".format(name, scores))

  print("\n" "Validation Performance:" "\n")

  for name, model in models:
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))

  
  plot_cv_scores(cv_scores, model_names, title_for_boxplots)
In [74]:
# Plotting boxplots based on cross validation scores
def plot_cv_scores(cv_scores, model_names, title):

  fig = plt.figure(figsize=(15, 7))

  fig.suptitle(title)
  ax = fig.add_subplot(111)

  plt.boxplot(cv_scores)
  ax.set_xticklabels(model_names)

  plt.show()

Building the 6 Basic/Default Models and generating BoxPlots using the CV Scores¶

In [75]:
build_models_calculate_cv_metrics(X_train,y_train,"CV Scores of Basic Models")
Cross-Validation Performance:

LogisticRegression: 0.40563579277864986
DecisionTree: 0.7817634746206175
Bagging: 0.7868602825745682
Random forest: 0.7694296180010466
GBM: 0.8175719518576663
Xgboost: 0.8555102040816326

Training Performance:

LogisticRegression: 0.4098360655737705
DecisionTree: 1.0
Bagging: 0.9877049180327869
Random forest: 1.0
GBM: 0.8811475409836066
Xgboost: 1.0

Validation Performance:

LogisticRegression: 0.46932515337423314
DecisionTree: 0.8282208588957055
Bagging: 0.8128834355828221
Random forest: 0.8128834355828221
GBM: 0.8588957055214724
Xgboost: 0.8773006134969326

Observation - Basic Models

XGBoost, GradientBoost and Bagging models seem to generate the best CV scores with XGBoost leading the way.

Oversampling training data using SMOTE¶

In [76]:
print("Before Oversampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)  
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)

print("After Oversampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))

print("After Oversampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before Oversampling, counts of label 'Yes': 976
Before Oversampling, counts of label 'No': 5099 

After Oversampling, counts of label 'Yes': 5099
After Oversampling, counts of label 'No': 5099 

After Oversampling, the shape of train_X: (10198, 22)
After Oversampling, the shape of train_y: (10198,) 

Building the same 6 Models using Oversampled data and generating BoxPlots using the CV Scores.¶

In [77]:
build_models_calculate_cv_metrics(X_train_over,y_train_over,"CV Scores of OverSampled Models")
Cross-Validation Performance:

LogisticRegression: 0.7995741692162636
DecisionTree: 0.9517566241413149
Bagging: 0.9666618561065251
Random forest: 0.979212607516019
GBM: 0.9747010717927995
Xgboost: 0.9839182974465548

Training Performance:

LogisticRegression: 0.8183957638752697
DecisionTree: 1.0
Bagging: 0.9976465973720338
Random forest: 1.0
GBM: 0.9792116101196313
Xgboost: 1.0

Validation Performance:

LogisticRegression: 0.8343558282208589
DecisionTree: 0.8650306748466258
Bagging: 0.8466257668711656
Random forest: 0.8650306748466258
GBM: 0.911042944785276
Xgboost: 0.9141104294478528

Observation - Models built on OverSampled data

The cv scores of all the models improved. The recall metric against the validation set too has gained from oversampling the data.

XGBoost, GradientBoost and RandomForest are the top performers with XGBoost continuing to be the leader.

Undersampling training data using Random Under Sampler¶

In [78]:
undersampler = RandomUnderSampler(random_state=1)
X_train_under, y_train_under = undersampler.fit_resample(X_train, y_train)

Building the same 6 Models using UnderSampled data and generating BoxPlots using the CV Scores.¶

In [79]:
build_models_calculate_cv_metrics(X_train_under,y_train_under,"CV Scores of UnderSampled Models")
Cross-Validation Performance:

LogisticRegression: 0.8145421245421245
DecisionTree: 0.888299319727891
Bagging: 0.9088173731030873
Random forest: 0.9375039246467818
GBM: 0.9416012558869703
Xgboost: 0.9518524332810048

Training Performance:

LogisticRegression: 0.8125
DecisionTree: 1.0
Bagging: 0.9907786885245902
Random forest: 1.0
GBM: 0.9795081967213115
Xgboost: 1.0

Validation Performance:

LogisticRegression: 0.8128834355828221
DecisionTree: 0.9079754601226994
Bagging: 0.9325153374233128
Random forest: 0.9325153374233128
GBM: 0.9601226993865031
Xgboost: 0.9662576687116564

Observation - Models built on UnderSampled data

The cv scores of the models have improved when compared to the scores of the basic models.

XGBoost, GradientBoost and RandomForest are the leaders and XGBoost retains the top place.

The top 3 Models targeted for Tuning

Overall, sampling (both Over and Under) has improved the recall metrics across all the 6 models in both the training as well as validation datasets.

  1. XGBoost consistently produced high metrics - Of the 3 XGBoost models, the one built using OverSampled data yeilded the best cv scores. The recall metric on the validation dataset looks good as well.
  2. GradingBoosting consistently produced high metrics and has a very small range of cv values. GradientBoost built using OverSampled data resulted in the highest cv scores. The recall metric on the validation dataset looks good as well.
  3. RandomForest generated improved metrics both with oversampling as well as undersampling. RandomForest built using OverSampled data returned the highest cv scores. The recall metric on the validation dataset looks good as well.

Hyperparameter Tuning using RandomizedSearchCV¶

Extreme Gradient Boosting Tuned using OverSampled Data¶

In [80]:
%%time
# Choose the type of classifier. 
xgb_tuned = XGBClassifier(random_state=1,eval_metric='logloss')

# Grid of parameters to choose from
parameters = {'n_estimators':np.arange(50,150,50),
            'scale_pos_weight':[2,5,10],
            'learning_rate':[0.009,0.01,0.1,0.2,0.05],
            'gamma':[0,1,3,5],
            'subsample':[0.8,0.9,1],
            'max_depth':np.arange(1,5,1),
            'reg_lambda':[5,10]}

# Run the randomized search
grid_obj = RandomizedSearchCV(estimator=xgb_tuned, param_distributions=parameters, n_iter=50, scoring='recall', cv=5, random_state=1, n_jobs=-1, verbose=2)
grid_obj = grid_obj.fit(X_train_over, y_train_over)

# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
xgb_tuned.fit(X_train_over, y_train_over)

print("Best parameters are {} with CV score={}:" .format(grid_obj.best_params_,grid_obj.best_score_))
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 10, 'reg_lambda': 10, 'n_estimators': 50, 'max_depth': 2, 'learning_rate': 0.05, 'gamma': 3} with CV score=0.9982352941176471:
CPU times: total: 4.61 s
Wall time: 58.8 s

Generate the metrics on training, validation and test data sets

In [81]:
xgb_tuned_score=get_metrics_score(xgb_tuned, X_train_over, y_train_over)
add_score_model(xgb_tuned_score)
Accuracy on training set :  0.789468523239851
Accuracy on validation set :  0.6609081934846989
Accuracy on test set :  0.6436327739387957
Recall on training set :  0.9994116493430084
Recall on validation set :  1.0
Recall on test set :  1.0
Precision on training set :  0.7038674033149172
Precision on validation set :  0.32181638696939785
Precision on test set :  0.3104106972301815
F1 on training set :  0.825998865386174
F1 on validation set :  0.48693054518297235
F1 on test set :  0.47376093294460647
In [82]:
confusion_matrix_sklearn(xgb_tuned, X_train_over, y_train_over)
In [83]:
confusion_matrix_sklearn(xgb_tuned, X_val, y_val)
In [84]:
confusion_matrix_sklearn(xgb_tuned, X_test, y_test)

Gradient Boosting Tuned using OverSampled Data¶

In [85]:
%%time
# Choose the type of classifier. 
gb_tuned = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)

# Grid of parameters to choose from
params = {'n_estimators':[100,150,200],
            'max_features':[0.7,0.8,0.9,1],
            'subsample':[0.8,0.9,1],
            'max_depth':[3,4,5,6]
            }

# Run the randomized search
randomized_search = RandomizedSearchCV(estimator=gb_tuned, param_distributions=params, n_iter=20, scoring='recall', cv=3, random_state=1, n_jobs=-1, verbose=2)
randomized_search = randomized_search.fit(X_train_over, y_train_over)

# Set the clf to the best combination of parameters
gb_tuned = randomized_search.best_estimator_

# Fit the best algorithm to the data.
gb_tuned.fit(X_train_over, y_train_over)

print("Best parameters are {} with CV score={}:" .format(grid_obj.best_params_,grid_obj.best_score_))
Fitting 3 folds for each of 20 candidates, totalling 60 fits
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 10, 'reg_lambda': 10, 'n_estimators': 50, 'max_depth': 2, 'learning_rate': 0.05, 'gamma': 3} with CV score=0.9982352941176471:
CPU times: total: 9.14 s
Wall time: 1min 6s

Generate the metrics on training, validation and test data sets

In [86]:
gb_tuned_score=get_metrics_score(gb_tuned, X_train_over, y_train_over)
add_score_model(gb_tuned_score)
Accuracy on training set :  0.9984310649146891
Accuracy on validation set :  0.9669299111549852
Accuracy on test set :  0.9703849950641659
Recall on training set :  0.9990194155716807
Recall on validation set :  0.9049079754601227
Recall on test set :  0.9384615384615385
Precision on training set :  0.997845249755142
Precision on validation set :  0.8912386706948641
Precision on test set :  0.8840579710144928
F1 on training set :  0.9984319874558996
F1 on validation set :  0.8980213089802132
F1 on test set :  0.9104477611940298
In [87]:
confusion_matrix_sklearn(gb_tuned, X_train_over, y_train_over)
In [88]:
confusion_matrix_sklearn(gb_tuned, X_val, y_val)
In [89]:
confusion_matrix_sklearn(gb_tuned, X_test, y_test)

Random Forest Tuned with OverSampled Data¶

In [90]:
%%time
# Choose the type of classifier. 
randomforest_tuned = RandomForestClassifier(class_weight={0:0.16,1:0.84},random_state=1)

parameters = {  
    'max_depth': list(np.arange(10,25,5)) + [None],
    'max_features': ['sqrt','log2',None],
    'min_samples_leaf': np.arange(1,15,5),
    'min_samples_split': np.arange(2, 20, 5),
    'n_estimators': np.arange(10,30,10)}


# Run the randomized search
grid = RandomizedSearchCV(estimator=randomforest_tuned, param_distributions=parameters, n_iter=50, scoring='recall', cv=5, random_state=1, n_jobs=-1, verbose=2)
grid = grid.fit(X_train_over, y_train_over)

# Set the clf to the best combination of parameters
randomforest_tuned = grid.best_estimator_

# Fit the best algorithm to the data. 
randomforest_tuned.fit(X_train_over, y_train_over)
print("Best parameters are {} with CV score={}:" .format(grid.best_params_,grid_obj.best_score_))
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best parameters are {'n_estimators': 20, 'min_samples_split': 2, 'min_samples_leaf': 11, 'max_features': 'log2', 'max_depth': 20} with CV score=0.9982352941176471:
CPU times: total: 2.06 s
Wall time: 33.6 s
In [91]:
randomforest_tuned_score=get_metrics_score(randomforest_tuned, X_train_over, y_train_over)
add_score_model(randomforest_tuned_score)
Accuracy on training set :  0.9541086487546577
Accuracy on validation set :  0.9032576505429417
Accuracy on test set :  0.9106614017769002
Recall on training set :  0.9992155324573446
Recall on validation set :  0.9294478527607362
Recall on test set :  0.9661538461538461
Precision on training set :  0.9165317503148048
Precision on validation set :  0.6365546218487395
Precision on test set :  0.6487603305785123
F1 on training set :  0.9560893225745919
F1 on validation set :  0.7556109725685786
F1 on test set :  0.7762669962917181
In [92]:
confusion_matrix_sklearn(randomforest_tuned, X_train_over, y_train_over)
In [93]:
confusion_matrix_sklearn(randomforest_tuned, X_val, y_val)
In [94]:
confusion_matrix_sklearn(randomforest_tuned, X_test, y_test)

Model Metric Matrix¶

In [95]:
comparison_frame = pd.DataFrame({'Model':['ExtremeGradientBoosting Tuned - OverSampled',
                                          'GradientBoosting Tuned - OverSampled',
                                          'RandomForest Tuned - OverSampled'],
                                          'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
                                          'Train_Recall':recall_train, 'Test_Recall':recall_test,
                                          'Train_Precision':precision_train,'Test_Precision':precision_test,
                                          'Train_F1':f1_train, 'Test_F1':f1_test}) 

for col in comparison_frame.select_dtypes(include='float').columns.to_list():
    comparison_frame[col] = (round(comparison_frame[col] * 100, 0)).astype(int)
    
#Sorting models in decreasing order of test recall metric
comparison_frame.sort_values(by='Test_Recall',ascending=False)
Out[95]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1 Test_F1
0 ExtremeGradientBoosting Tuned - OverSampled 79 64 100 100 70 31 83 47
2 RandomForest Tuned - OverSampled 95 91 100 97 92 65 96 78
1 GradientBoosting Tuned - OverSampled 100 97 100 94 100 88 100 91

Conclusion drawn from Model Metric Matrix

The Tuned Extreme Gradient Boosting Model gave us the best recall metrics - A complete 100 both on the training as well as the testing datasets. However it returned sub-optimal metrics for Accuracy, Precision and F1. GradientBoosting on the other hand, gave us a recall metric of 94 on the test data along with very high metrics across Accuracy(97), Precisision(88) and F1(91). So making GradientBoosting Tuned model using OverSampled data as our final/production model.

Importance of features as rated by the Gradient Boost Model¶

In [96]:
feature_names = list(X_train_over.columns)
importances = gb_tuned.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

The total count of transacations is #1, followed by total transaction amount, total revolving monthly balance, then the total products count of the customer with the bank are the important features in predicting attrition.

In [97]:
# create a scatterplot
sns.scatterplot(x=data['Total_Trans_Ct'], y=data['Attrition_Flag'])
plt.show()
In [98]:
# create a scatterplot
sns.scatterplot(x=data['Total_Trans_Amt'], y=data['Attrition_Flag'])
plt.show()
In [99]:
# create a scatterplot
sns.scatterplot(x=data['Total_Revolving_Bal'], y=data['Attrition_Flag'])
plt.show()
In [100]:
# crosstabs
pd.crosstab(index=data["Attrition_Flag"], columns=data["Total_Relationship_Count"])
Out[100]:
Total_Relationship_Count 1 2 3 4 5 6
Attrition_Flag
1 233 346 400 225 227 196
0 677 897 1905 1687 1664 1670

Pipelines for productionizing the model¶

Pipeline is a means of automating the machine learning workflow by enabling data to be transformed and correlated into a model that can then be analyzed to achieve outputs. This type of ML pipeline makes the process of inputting data into the ML model fully automated.

Now, we have a final model. Let's use pipelines to put the model into production.

Column Transformer We know that we can use pipelines to standardize the model building, but the steps in a pipeline are applied to each and every variable - how can we personalize the pipeline to perform different processing on different columns? Column transformer allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

In [101]:
#Encoding the target variable
pipeline_data= csv_data.copy()
pipeline_data['Attrition_Flag'] = pipeline_data['Attrition_Flag'].replace({"Attrited Customer": 1, "Existing Customer": 0})
In [102]:
# Separating target variable from the independent features
X_prod = pipeline_data.drop(columns="Attrition_Flag")
y_prod = pipeline_data["Attrition_Flag"]

Now we already know the best model we need to process with, so we don't need to divide data into 3 sets - train, validation and test

In [103]:
# Splitting the data into train and test sets
X_prod_train, X_prod_test, y_prod_train, y_prod_test = train_test_split(
    X_prod, y_prod, test_size=0.30, random_state=1, stratify=y
)
print(X_prod_train.shape, X_prod_test.shape)
(7088, 20) (3039, 20)

Verify the ratio of the class values in the target variable (84:16) is the same between the training and test datasets.

In [104]:
y_prod_train.value_counts(1)
Out[104]:
0    0.839306
1    0.160694
Name: Attrition_Flag, dtype: float64
In [105]:
y_prod_test.value_counts(1)
Out[105]:
0    0.839421
1    0.160579
Name: Attrition_Flag, dtype: float64

Steps that are involved in prepepping the data for model training

  • Erroneous data in the Income_Category column needs to be rectified.
  • Missing values in the Marital_Status and Education_Level need to be imputed.
  • Education_Level, Card_Category and Income_Category need to go through oridnal encoding.
  • Gender and Marital_Status columns need to go through OneHot Encoding.

Building multiple transformers to take care of these steps.

Custom Transformer

Building a custom transformer to

  • Drop the CLIENTNUM column which has all unique values and does not help with the model prediction
  • Replace the errorneous data in the Income_Category column with null values, so they could be imputed.
  • Replace the missing values in the numeric columns with mean values
  • Replace the missing values in the non-numeric columns with the most-frequent value.
In [106]:
#ColumnTransformer returns numpy.array, so it can't have column attribute and so we cannot
#use an OrdinalEncoder after we have passed the data through a SimpleImputer. Hence avoiding
#using SimpleImputer and doing the imputing in the custom transformer

def myProcessingSteps(X_prod_train):
    #Dropping the column with all unique values
    X_prod_train.drop(columns=["CLIENTNUM"], inplace=True)

    # Replacing the values in Education column
    X_prod_train['Income_Category'].replace(to_replace="abc", value=np.NaN, inplace=True)

    
    # Imputing numeric columns
    for col in X_prod_train.select_dtypes(include=['number']):
      X_prod_train[col].fillna(X_prod_train[col].mean(), inplace=True)

    # Imputing categorical columns
    for col in X_prod_train.select_dtypes(include=['object']):
      X_prod_train[col].fillna(X_prod_train[col].mode()[0], inplace=True)
    
    return X_prod_train

Create an instance of the Custom Transformer

In [107]:
# The function created for processing the data should be passed as an arugument in the FunctionTransformer
custom_transformer = FunctionTransformer(myProcessingSteps)

Transformer to handle Ordinal Ecoding and OneHot Encoding

In [108]:
#ColumnTransformer returns numpy.array, so it can't have column attribute and so we cannot
#use an OrdinalEncoder after we have passed the data through a SimpleImputer. Hence avoiding
#using SimpleImputer and imputing within the custom transformer

#Create categories to instruct how ordinal encoder should work
education_order = ['Uneducated','High School','College','Graduate','Post-Graduate','Doctorate']
card_order = ["Blue", "Silver", "Gold", 'Platinum']
income_order = ['Less than $40K','$40K - $60K', '$60K - $80K','$80K - $120K', '$120K +']

#2nd Ordinal Encoding and One Hot Encoding
ordinal_ohe_transformer = ColumnTransformer([
        ('encode_education_level', OrdinalEncoder(categories=[education_order],handle_unknown='error'),['Education_Level']),
        ('encode_card_category', OrdinalEncoder(categories=[card_order],handle_unknown='error'),['Card_Category']),
        ('encode_income_category', OrdinalEncoder(categories=[income_order],handle_unknown='error'),['Income_Category']),
        ('ohe_transformer', OneHotEncoder(sparse=False, handle_unknown='ignore'),['Gender','Marital_Status'])
    ], remainder='passthrough',sparse_threshold=0)

Generating Oversampled Data

In [109]:
smote = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)

Setting the best model, Gradient Boosting in our case.

In [110]:
model = gb_tuned

Create the pipeline and pass the transformers in the right order

In [111]:
pipe = ImbPipeline([
    ('custom_transformer', custom_transformer),
    ('ordinal_ohe_transformer', ordinal_ohe_transformer),
    ('smote', smote),
    ('model', model)
])
In [112]:
# Display the hierarchy of the transformers in the Pipeline
from sklearn import set_config
set_config(display='diagram')
In [113]:
#Fit the pipeline on training data
pipe.fit(X_prod_train, y_prod_train)
Out[113]:
Pipeline(steps=[('custom_transformer',
                 FunctionTransformer(func=<function myProcessingSteps at 0x000001D0E9F0ED30>)),
                ('ordinal_ohe_transformer',
                 ColumnTransformer(remainder='passthrough', sparse_threshold=0,
                                   transformers=[('encode_education_level',
                                                  OrdinalEncoder(categories=[['Uneducated',
                                                                              'High '
                                                                              'School',
                                                                              'College',
                                                                              'Graduate',
                                                                              'Post-Graduate',
                                                                              'Doctorate']]),
                                                  [...
                                                                              '$80K',
                                                                              '$80K '
                                                                              '- '
                                                                              '$120K',
                                                                              '$120K '
                                                                              '+']]),
                                                  ['Income_Category']),
                                                 ('ohe_transformer',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  ['Gender',
                                                   'Marital_Status'])])),
                ('smote', SMOTE(random_state=1, sampling_strategy=1)),
                ('model',
                 GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                                            max_depth=6, max_features=0.7,
                                            random_state=1, subsample=0.8))])
Please rerun this cell to show the HTML repr or trust the notebook.
Pipeline(steps=[('custom_transformer',
                 FunctionTransformer(func=<function myProcessingSteps at 0x000001D0E9F0ED30>)),
                ('ordinal_ohe_transformer',
                 ColumnTransformer(remainder='passthrough', sparse_threshold=0,
                                   transformers=[('encode_education_level',
                                                  OrdinalEncoder(categories=[['Uneducated',
                                                                              'High '
                                                                              'School',
                                                                              'College',
                                                                              'Graduate',
                                                                              'Post-Graduate',
                                                                              'Doctorate']]),
                                                  [...
                                                                              '$80K',
                                                                              '$80K '
                                                                              '- '
                                                                              '$120K',
                                                                              '$120K '
                                                                              '+']]),
                                                  ['Income_Category']),
                                                 ('ohe_transformer',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse=False),
                                                  ['Gender',
                                                   'Marital_Status'])])),
                ('smote', SMOTE(random_state=1, sampling_strategy=1)),
                ('model',
                 GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                                            max_depth=6, max_features=0.7,
                                            random_state=1, subsample=0.8))])
FunctionTransformer(func=<function myProcessingSteps at 0x000001D0E9F0ED30>)
ColumnTransformer(remainder='passthrough', sparse_threshold=0,
                  transformers=[('encode_education_level',
                                 OrdinalEncoder(categories=[['Uneducated',
                                                             'High School',
                                                             'College',
                                                             'Graduate',
                                                             'Post-Graduate',
                                                             'Doctorate']]),
                                 ['Education_Level']),
                                ('encode_card_category',
                                 OrdinalEncoder(categories=[['Blue', 'Silver',
                                                             'Gold',
                                                             'Platinum']]),
                                 ['Card_Category']),
                                ('encode_income_category',
                                 OrdinalEncoder(categories=[['Less than $40K',
                                                             '$40K - $60K',
                                                             '$60K - $80K',
                                                             '$80K - $120K',
                                                             '$120K +']]),
                                 ['Income_Category']),
                                ('ohe_transformer',
                                 OneHotEncoder(handle_unknown='ignore',
                                               sparse=False),
                                 ['Gender', 'Marital_Status'])])
['Education_Level']
OrdinalEncoder(categories=[['Uneducated', 'High School', 'College', 'Graduate',
                            'Post-Graduate', 'Doctorate']])
['Card_Category']
OrdinalEncoder(categories=[['Blue', 'Silver', 'Gold', 'Platinum']])
['Income_Category']
OrdinalEncoder(categories=[['Less than $40K', '$40K - $60K', '$60K - $80K',
                            '$80K - $120K', '$120K +']])
['Gender', 'Marital_Status']
OneHotEncoder(handle_unknown='ignore', sparse=False)
['Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']
passthrough
SMOTE(random_state=1, sampling_strategy=1)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1), max_depth=6,
                           max_features=0.7, random_state=1, subsample=0.8)
AdaBoostClassifier(random_state=1)
In [114]:
#Calculate the recall store
recall_score(y_prod_test, pipe.predict(X_prod_test))
Out[114]:
0.9221311475409836

We successfully built a Pipeline using the Gradient Boosting Model. This concludes the building and validation of the Pipeline.

Conclusion¶

The bank can run the model to achieve desired performance levels on new data, and also gain insights into the customer profiles of existing customers.

Actionable Insights¶

Data Capture Enhancements

  1. There was imbalance in data, it contains only 16% attrited customers. This could be rectified before any future analysis.

  2. The data capture issues resulted in Missing values in Marital Status and Education_Level, which are imputed with most frequent values in dataset. This adds to imbalance in the data. This should be rectified.

  3. "abc" Category in Income_Level looks like typo for NULL Values and were also imputed using frequent values further creating one large segment of customers earning below $40 K.

  4. Capturing additional information related to Customer Employment, Living Status and Spending Pattern will help in smart imputation.

Feature Importance

We observed in our analysis that the total tranaction count is the #1 feature of importance, followed by the total tranaction amount, month-to-month balance, then the total amount of products a customer holds with the bank in predicting whether a customer will attrit. The company should carefully consider these features for monitoring and setting thresholds for raising alerts.

Attrition Indicators Company should target customers who attrit based on their historical customer profile from this dataset (as a starting point).

The Attrited customers:

are primarily in the age group of 41 to 52

have been with the bank between 2.5 to 3.5 years

have not used the credit card for 2 to 3 months in the last 12 months

have a credit limit between ~2K and ~10K

have a revolving balance between 0 and 1.3K

spend anywhere between ~1.9K to ~2.7K annually

use the credit card between 37 to 51 times annually

majority of them are married female graduates with income less than 40K.

Recommendations¶

Card categories other than Blue could be promoted more.

Business can also tailor offers towards Female Customers to reduce attrition.

Business can focus on customer using Revolving Credit and curate specific offers for them to encourage loyalty to the bank.

Customers with Income more that $120K should be targeted to enhance their satisfaction and reduce attrition.

Customers with Doctorate degrees contribute more to the attrited customers. Business can use the Customer satisfaction survey data, implement findings and improve the retention rates.

Specific promotional and rewards programs can be created for different income groups.

Business has to also focus on customers that have not used their cards for more than 3 months in last one year. There are chances that these customers may close the card accounts.